Speech recognition by dynamical noise model adaptation

ABSTRACT

The invention provides a Hidden Markov Model ( 132 ) based automated speech recognition system ( 100 ) that dynamically adapts to changing background noise by detecting long pauses in speech, and for each pause processing background noise during the pause to extract a feature vector that characterizes the background noise, identifying a Gaussian mixture component of noise states that most closely matches the extracted feature vector, and updating the mean of the identified Gaussian mixture component so that it more closely matches the extracted feature vector, and consequently more closely matches the current noise environment. Alternatively, the process is also applied to refine the Gaussian mixtures associated with other emitting states of the Hidden Markov Model.

FIELD OF THE INVENTION

[0001] This invention pertains to automated speech recognition. Moreparticularly this invention pertains to speaker independent speechrecognition suitable for varied background noise environments.

BACKGROUND OF THE INVENTION

[0002] Recently as the processing power of portable electronic deviceshas increased there has been an increased interest in adding speechrecognition capabilities to such devices. Wireless telephones that arecapable of operating under the control of voice commands have beenintroduced into the market. Speech recognition has the potential todecrease the effort and attention required of users operating wirelessphones. This is especially advantageous for users that are frequentlyengaged in other critical activities (e.g., driving) while operatingtheir wireless phones.

[0003] The most widely used algorithms for performing automated speechrecognition (ASR) are based on Hidden Markov Models (HMM). In a HMM ASRspeech is modeled as a sequence of states. These states are assumed tobe hidden and only output based on the states, i.e. speech is observed.According to the model, transitions between these states are governed bya matrix of transition probabilities. For each state there is an outputfunction, specifically a probability density function that determines ana posteriori probability that the HMM was in the state, given measuredfeatures of an acoustic signal. The matrix of transition probabilities,and parameters of the output functions are determined during a trainingprocedure which involves feeding known words, and or sentences into theHMM ASR and fine tuning the transition probabilities and output functionparameters to achieve optimized recognition performance.

[0004] In order to accommodate the variety of accents and othervariations in the way words are pronounced, spoken messages to beidentified using a HMM ASR system are processed in such a manner as toextract feature vectors that characterize successive periods of thespoken message.

[0005] In performing ASR a most likely sequence of the states of the HMMis determined in view of the transition probability for each transitionin the sequence, the extracted feature vectors, and the a posterioriprobabilities associated with the states.

[0006] Background noise, which predominates during pauses in speech, isalso modeled by one or more states of the HMM model so that the ASR willproperly identify pauses and not try to construe background noise asspeech.

[0007] One problem for ASR systems, particularly those used in portabledevices, is that the characteristics of the background noise in theenvironment of the ASR system is not fixed. If an ASR system is trainedin an acoustic environment where there is no background noise, or in anacoustic environment with one particular type of background noise, thesystem will be prone to making errors when operated in an environmentwith background noise of different type. Different background noise thatis unfamiliar to the ASR system may be construed as parts of speech.

[0008] What is needed is a ASR system that can achieve high rates ofspeech recognition when operated in environments with different types ofbackground noise.

[0009] What is needed is a ASR system that can adapt to different typesof background noise.

BRIEF DESCRIPTION OF THE DRAWINGS

[0010] The features of the invention believed to be novel are set forthin the claims. The invention itself, however, may be best understood byreference to the following detailed description of certain exemplaryembodiments of the invention, taken in conjunction with the accompanyingdrawings in which:

[0011]FIG. 1 is a functional block diagram of a system for performingautomated speech recognition according to the preferred embodiment ofthe invention.

[0012]FIG. 2 is a flow chart of a process for updating a model ofbackground noise according to the preferred embodiment of the invention.

[0013]FIG. 3 is a high level flow chart of a process of performingautomated speech recognition using a Hidden Markov Model.

[0014]FIG. 4 is a first part of flow chart of a process for extractingfeature vectors from an audio signal according to the preferredembodiment of the invention.

[0015]FIG. 5 is a second part of the flow chart begun in FIG. 4

[0016]FIG. 6 is a hardware block diagram of the system for performingautomated speech recognition according to the preferred embodiment ofthe invention.

DETAILED DESCRIPTION OF THE PREFERRED EMBODIMENT

[0017] While this invention is susceptible of embodiment in manydifferent forms, there are shown in the drawings and will herein bedescribed in detail specific embodiments, with the understanding thatthe present disclosure is to be considered as an example of theprinciples of the invention and not intended to limit the invention tothe specific embodiments shown and described. Further, the terms andwords used herein are not to be considered limiting, but rather merelydescriptive. In the description below, like reference numbers are usedto describe the same, similar, or corresponding parts in the severalviews of the drawings.

[0018]FIG. 1 is a functional block diagram of a system 100 forperforming automated speech recognition according to the preferredembodiment of the invention. Audio signals from a transducer (e.g.,microphone, not shown) are input at and an input 102 of an audio signalsampler 104. The audio signal sampler 104 preferably samples the audiosignal at a sampling rate of about 8,000 to 16,000 samples per secondand at 8 to 16 bit resolution and outputs a representation of the inputaudio signal that is discretized in time and amplitude. The audiosignals may be represented as a sequence of binary numbers:

X_(n), n=0 . . . N,

[0019] where

[0020] X_(n) is an nth indexed digitized sample, and

[0021] the index n ranges up to a limit N determined by the length ofthe audio signal.

[0022] A Finite Impulse Response (FIR) time domain filter 106 is coupledto the audio signal sampler 104 for receiving the discretized audiosignal. The FIR filter 106 serves to increase the magnitude of highfrequency components compared to low frequency components of thediscretized audio signal. The FIR time domain filter 106 processes thediscretized audio signal and outputs a sequence of filtered discretizedsamples at the sampling rate. The each nth filter output may beexpressed as: $X_{n}^{l} = {\sum\limits_{k = 0}^{M}{C_{k}X_{n - k}}}$

[0023] where

[0024] X^(l) _(n) is an nth time domain filtered output,

[0025] C_(k) is a kth FIR time domain filter coefficient,

[0026] M is one less than the number of FIR time domain coefficients;and

[0027] X_(n−k) is an indexed digitized sample received from the audiosignal sampler 104.

[0028] Preferably, M is equal to 1, C₀ is about equal to unity and C₁ isabout equal to negative 0.95. Other suitable filter functions may beused for pre-emphasizing high frequency components of the discretizedaudio signal.

[0029] A windower 108 is coupled to the FIR filter 106 for receiving thefiltered discretized samples. The windower 108 multiplies successivesubsets of filtered discretized samples by a discretized representationof a window function. For example each subset that is termed a frame maycomprise about 25 to 30 ms of speech. (about 200 to 480 samples).Preferably, there is about a 15-20 ms overlaps between the twosuccessive blocks. Each filtered discretized sample in each frame ismultiplied by a specific coefficient of the window function that isdetermined by the position of the filtered discretized sample in thewindow. The windower 108 preferably outputs windowed filtered speechsamples at an average rate equal to the inverse of the differencebetween length of each frame and the overlap between frames. Eachwindowed filtered sample within a frame may be denoted:

X _(n) ^(F) =X _(n) ^(l) W _(n)

[0030] where

[0031] the index n now denotes position within a frame;

[0032] the index F denotes a frame number;

[0033] X_(n) ^(F) is a nth windowed filtered sample; and

[0034] W_(n) is a window coefficient corresponding to the nth positionwithin each frame.

[0035] Applying the windowing function to the discretized audio signal,aids in reducing spectral overlap between adjacent frequency componentsthat are output by a Fast Fourier Transform FFT 110. A Hamming windowfunction is preferred.

[0036] The FFT 110 is coupled to the windower 108 for receiving thesuccessive frames of windowed filtered samples. The FFT projectssuccessive frames of windowed filtered discretized audio signal samplesonto a Fourier frequency domain basis to obtain and outputs a pluralityof audio signal Fourier frequency components, and processes the Fourierfrequency components to determine a set of power Fourier frequencycomponent for each frame. The FFT 110 outputs a sequence of powerFourier components. The power FFT components are given by the followingrelations:${P(0)} = {\left. \frac{1}{N^{2}} \middle| C_{0} \middle| {}_{2}{P\left( f_{k} \right)} \right. = {\frac{1}{N^{2}}\left\lbrack \left| C_{k} \middle| {}_{2}{+ \left| C_{N - k} \right|^{2}} \right. \right\rbrack}}$${P\left( f_{N/2} \right)} = \left. \frac{1}{N^{2}} \middle| C_{N/2} \right|^{2}$

[0037] where,

[0038] P(0) is a zero order power Fourier frequency component (equal toan average of power of a frame);

[0039] P(f_(l)) is an lth power Fourier frequency component of theframe;

[0040] N is the number of samples per frame; and${C_{k} = {{\sum\limits_{n = 0}^{N - 1}{X_{n}^{F}^{2\pi \quad i\quad n\quad {k/N}}\quad k}} = 0}},\ldots \quad,{N - 1}$

[0041] where

[0042] C_(K) is a kth Fourier frequency component;

[0043] i is the square root of negative one;

[0044] n is a summation index;

[0045] N−1 is the number of samples per frame

[0046] A MEL scale filter bank 112 is coupled to the FFT 110 forreceiving the power Fourier frequency components. The MEL scale filterbank includes a plurality of MEL scale band pass filters 112A, 112B,112C, 112D (four of which are shown). Each MEL scale band pass filterpreferably is a weighted sum of a plurality of power Fourier frequencycomponents. The MEL scale band pass filters 112A-112D preferably have atriangular profile in the frequency domain. Alternatively, the MEL scalebandpass filters 112A-112D have Hamming or Hanning frequency domainprofile. Each MEL bandpass filter 112A-112D preferably integrates aplurality of power Fourier frequency components into a MEL scalefrequency component. By integrating plural power Fourier frequencycomponents with the MEL bandpass filters 112A-112D the dimensionality ofthe audio signal information is reduced. The MEL scale bands are chosenin view of understood characteristics of human acoustic perception.There are preferably about 10 evenly spaced MEL scale bandpass filtersbelow 1 KHz. Beyond 1 KHz the bandwidth of successive MEL frequencybandpass filters preferably increase by a factor of about 1.2. There arepreferably about 10 to 20 MEL scale bandpass filters above 1 KHz, andmore preferably about 14. The MEL scale filter bank 112 outputs aplurality of MEL scale frequency components. An mth MEL scale frequencycomponent of the MEL scale filter bank 112 corresponding to an mth MELbandpass filter is denoted Z(m).

[0047] A log-magnitude evaluator 114 is coupled to the MEL scalefrequency filter bank 112 for applying a composite function to each MELscale frequency component. The composite function comprises taking themagnitude of each MEL scale frequency component, and taking the log ofthe result. By taking the magnitude of each MEL scale frequencycomponent, phase information, which does not encode speech information,is discarded. By discarding phase information, the dimensionality ofacoustic signal information is further reduced. By taking the log of theresulting magnitude the magnitudes of the MEL scale frequency componentsare put on a scale which more accurately models the response of thehuman hearing to changes in sound intensity. The log-magnitude evaluator114 outputs a plurality of rescaled magnitudes of the MEL scalefrequency components of the form log(|Z(m)|).

[0048] A discrete cosine transform block (DCT) 116 is coupled to the logabsolute value taker 114 for receiving the rescaled magnitudes. The DCT116 transforms the rescaled magnitudes to the time domain. The output ofthe DCT 116 comprises a set of DCT components values (cepstralcoefficients) for each frame. The zero order component output by the DCTis proportional to the log energy of the acoustic signal during theframe from which the component was generated. The DCT components outputby the DCT 116 are preferably of the following form:${y^{P}(k)} = {\sum\limits_{m = 1}^{M}{{\log \left( \left| {Z(m)} \right| \right)}{\cos \left( {{k\left( {m - \frac{1}{2}} \right)}\frac{\pi}{M}} \right)}}}$

[0049] where

[0050] y^(P)(k) is a kth order DCT component output by the DCT 116 for apth frame; and

[0051] M in this case is the number of MEL scale frequency components.

[0052] The summation on the left hand side of the above equation effectsthe DCT transformation. The DCT components are also termed cepstrumcoefficients.

[0053] The windower 108, FFT 110, MEL scale filter bank 112,log-magnitude evaluator 114, and DCT 116 operate in synchronism. The DCT116 sequentially outputs sets of DCT components corresponding to framesof discretized samples output by the windower 108.

[0054] A first buffer 118 is coupled to the DCT 116 for receivingsuccessive sets of DCT component values. A differencer 120 is coupled tothe first buffer 118 for receiving successive sets of DCT componentvalues. The differencer 120 operates on two or more successive sets ofcomponent values by taking the difference between corresponding DCTcomponent values from different sets and outputting sets of discretedifferences (including one difference for each DCT component) of firstand/or higher order, for each frame. The discrete differencescharacterize the time-wise variation of the DCT component values. Thelth order discrete time difference for the pth frame Δ^(l)(y^(P)(k))applied to the sequence of DCT components is given by the followingrecursion relations:

Δ^(l)(y ^(P)(k))=Δ^(l−1)(y ^(P+1)(k)−Δ^(l−1)(y ^(P−1)(k))

Δ⁰(y ^(P)(k))=y ^(P)(k)

[0055] The DCT component values output for each frame by the DCT 116,along with discrete differences of one or more orders serve tocharacterize the audio signal during each frame. (The DCT componentvalues and the discrete differences are numbers.) The DCT componentvalues and discrete differences of one or more orders are preferablystored in arrays (one for each frame) and treated as vectors,hereinafter termed feature vectors. Preferably, DCT components and thefirst two orders of differences are used in the feature vectors. Thefeature vectors for a given frame P are denoted:

Y^(P)=[Y₁ ^(P),Y₂ ^(P),Y₃ ^(P), . . . Y_(K) ^(P). . . Y_(D) ^(P)]

[0056] where the first k vector elements are DCT components, and the(k+1)th through Dth vector elements are discrete differences of the DCTcomponents.

[0057] According to an alternative embodiment the differencer 120 iseliminated, and only the DCT components are used to characterize theaudio signal during each frame.

[0058] The first buffer 118, and the differencer 120 are coupled to asecond buffer 122. The feature vectors are assembled and stored in thesecond buffer 122.

[0059] The above described functional blocks including the audio signalsampler 104, FIR time domain filter 106, windower 108, FFT 110, MELscale filter bank 112, log-magnitude evaluator 114, DCT 116, firstbuffer 118, differencer 120, and second buffer 122, are parts of afeature extractor 124. The function of the feature extractor 124 is toeliminate extraneous, and redundant information from audio signals thatinclude speech sounds, and produce feature vectors each of which ishighly correlated to a particular sound that is one variation of acomponent of spoken language. Although a preferred structure andoperation of the feature extractor 124 has been described above, othertypes of feature extractor that have different internal structures,and/or operate differently to process audio signals that include speechsounds, and produce by such processing characterizations of differentsub parts (e.g., frames) of the audio signal may be used in practicingthe invention.

[0060] The second buffer 122 supplies feature vectors for each frame toa Hidden Markov Model (HMM) 132. The HMM 132 models spoken language. TheHMM 132 comprises a hierarchy of three interconnected layers of statesincluding an acoustic layer 134, a phoneme layer 136, and a word layer138. The word layers 138 includes a plurality of states corresponding toa plurality of words in a vocabulary of the HMM. Transitions betweenstates in the word layer are governed by a word layer transition matrix.The word layer transition matrix includes a probability for eachpossible transition between word states. Some transition probabilitiesmay be zero.

[0061] The phoneme layer 136 includes a word HMM for each word in theword layer 138. Each word HMM includes a sequence of statescorresponding to a sequence of phonemes that comprise the word.Transitions between phoneme states in the word layer are also governedby a matrix of transition probabilities. There may be more than one wordHMM for each word in the word layer 138.

[0062] Finally, the acoustic layer 134 includes a phoneme HMM model ofeach phoneme in the language that the HMM 132 is capable of recognizing.Each phoneme HMM includes beginning states and ending states. A firstphoneme HMM model 140 and second phoneme HMM model 142 are illustrated.In actuality, there are many phoneme HMM models in the acoustic layer134. The details of phoneme HMM models will be discussed with referenceto the first phoneme HMM model 140. A beginning state 140A and an endingstates 140D are non-emitting which is to say that these states 140A,140D are not associated with acoustic features. Between the beginningand ending states of each phoneme HMM are a number of acoustic emittingstates (e.g., 140B, 140C). Although two are shown for the purpose ofillustration, in practice there may be more than two emitting states ineach phoneme model. Each emitting state of each phoneme HMM model (e.g.,140) is intended to correspond to an acoustically quasi stationary frameof a phoneme. Transitions between the states in each phoneme model arealso governed by a transition probability matrix.

[0063] The acoustic layer also includes an HMM model 156 for the absenceof speech sounds that occur between speech sounds (e.g., between words,and between sentences). The model for the absence of speech sounds 156(background sound model) 156 is intended to correspond to backgroundnoise which predominates in the absence of speech sounds. The backgroundsound model 156 includes a first state 158 that is non-emitting, and afinal state 160 that is non-emitting. An emitting state 146 is locatedbetween the first 158 and final 160 states. The emitting state 146represents background sounds. As mentioned above a difficulty arises inASR due to the fact that the background noise varies.

[0064] Feature vectors that characterizes the audio signal that areoutput by the feature extractor 124 are input into the HMM 132 and usedwithin the acoustic layer 134. Each emitting state in the acoustic layer134 has associated with it a probability density function (PDF) whichdetermines the a posteriori probability that the acoustic state occurredgiven the feature vector. The emitting states 140B and 140C of the firstphoneme HMM have associated probability density functions 144 and 162respectively. Likewise, the emitting state 146 of the background soundmodel 156 has a background sound PDF 148. Gaussian mixture componentmeans for the background sound model 156, that uses Gaussian mixturecomponent means 150 that are described below.

[0065] The a posteriori probability for each emitting state (includingthe emitting state 146 in the background sound model 150) is preferablya multi component Gaussian mixture of the form:${b_{j}\left( Y^{P} \right)} = {\sum\limits_{n = 1}^{M}{c_{j}^{n}{b_{j}^{n}\left( Y^{P} \right)}}}$

[0066] where,

[0067] b_(j)(Y^(P)) is the a posteriori probability that the HMM model132 was in a jth state during frame P given the fact that the audiosignals during frame P was characterized by a feature vector Y^(P);

[0068] C_(j) ^(n) is a mixture component weight; and

[0069] b_(j) ^(n)(Y^(P)) is an nth mixture component for the jth statethat is given by:${b_{j}^{n}\left( Y^{P} \right)} = {\frac{1}{\sqrt{\left( {2\pi} \right)^{D}{\prod\limits_{i = 1}^{D}\sigma_{i\quad j\quad n}^{2}}}}\exp \left\{ {{- \frac{1}{2}}{\sum\limits_{i = 1}^{D}\frac{\left( {Y_{i}^{P} - \mu_{ijn}} \right)^{2}}{\sigma_{{ij}\quad n}}}} \right\}}$

[0070] where,

[0071] μ_(ijn) is a mean of an ith parameter (corresponding to an ithelements of the feature vectors), of the nth mixture component of thejth acoustic state 132 (for a phoneme or for background sounds) of theHMM model.

[0072] σ_(ijn) is a variance associated with the ith parameter of thenth mixture component of the jth acoustic state of the acoustic layer.

[0073] The means μ_(ijn) serve as reference characterizations of a soundmodeled by the a posteriori probability.

[0074] In the operation a seach engine 164 searches the HMM 132, for oneor more sequences of states that are characterized by highprobabilities, and outputs one or more sequences of words thatcorrespond to the high probability sequences of states. The probabilityof sequences of states are determined by the product of transitionprobabilities for the sequence of states multiplied by the a posterioriprobabilities that the sequence of states occurred based on theirassociated a posteriori probabilities in view of a sequence of featurevectors extracted from the audio signal to be recognized. The aposteriori probabilities evaluating the a posteriori probabilitiesassociated with a sequence of postulated states with an extractedsequence of feature vectors. Expressed mathematically the probability ofa sequence of states S^(1 . . . T) given the fact that a sequence offeature vectors Y^(1 . . . T) was extracted from the audio signal isgiven by:${P\left( {S^{1\quad \ldots \quad T},Y^{1\quad \ldots \quad T},\Theta} \right)} = {\pi_{s_{1}}{b_{s_{1}}\left( Y^{1} \right)}{\prod\limits_{t = 2}^{T}{a_{S_{t - 1}S_{t}}{b_{s_{t}}\left( Y^{t} \right)}}}}$

[0075] where

[0076] Θ specifies the underlying HMM model;

[0077] π_(s1) specifies the probability of a first postulated state inthe sequence of states.;

[0078] a_(S) _(t−1) _(S) _(t) specifies the probability of a transitionbetween a first state postulated for a first time t−1 and second statepostulated for the successive time t; and

[0079] other quantities are defined above.

[0080] Various methods are know to persons of ordinary skill in the ASRart for finding a likely sequence of states without having toexhaustively evaluate the above equation for each possible sequence ofstates. One known method is the Viterbi search method.

[0081] In the HMM 132, transitions from various phoneme states to themodel for the absence of speech sounds are allowed. Such transitionsoften occur at the end of postulated words. Thus, in order to be able todetermine the ending of words, and in order to be able to discriminatebetween short words that sound like the beginning of longer words andthe longer words, it is important to be able to recognize backgroundsounds.

[0082] In training an HMM based ASR system that includes a model ofnon-speech sounds, certain parameters that described the non speechbackground sounds must be set. For example if an a posterior probabilityof the form shown above is used then the mixture component weights, themeans μ_(ijn) and the variances σ_(ijn) that characterize backgroundsound must be set during training. As discussed in the backgroundsection characteristics of the background sound are not fixed. If aportable device that includes an HMM ASR system is taken to differentlocations the characteristics of the background sound is likely tochange. When the background sound in use differs from that presentduring training, the HMM ASR is more likely to make errors.

[0083] According to the present invention a model used in the ASR,preferably the model of non-speech background sounds is updatedfrequently while the ASR is in regular use. The model of non-speechbackground sounds is updated so as to better model current backgroundsounds. According to the present invention, the background sound ispreferably measured in the absence of speech sounds, e.g., between wordsor sentences. According to the preferred embodiment of the invention theupdating takes place during breaks of at least 600 milliseconds, e.g.breaks that occur between sentences.

[0084] According to the preferred embodiment of the invention, thedetection of the absence of voiced sounds is premised on the assumptionthat speech sounds reaching the input 102 of the ASR system 100 havegreater power than background sounds. According to the preferredembodiment of the invention the interruptions in speech sounds betweensentences are detected by comparing the zero order DCT coefficient ofeach frame which represents the log energy of each frame to a threshold,and requiring that the zero order DCT coefficient remain below thethreshold for a predetermined period. By requiring that the zero orderDCT coefficient remain below the threshold it is possible to distinguishlonger inter sentence breaks in speech sound from shorter intra sentencebreaks. According to an alternative embodiment of the invention anabsence of speech sounds is detected by comparing a weighted sum of DCTcoefficients to a threshold value. The threshold may be set dynamicallybased on a running average of the power of the audio signal.

[0085] An inter sentence pause detector 152 is coupled to the DCT 116for receiving one or more of the coefficients output by the DCT for eachframe. Preferably, the inter-sentence pause detector receives the zeroorder DCT coefficient (log energy value) for each frame. If the zeroorder DCT, (Alternatively, a sum of DCT coefficients, or a weighted sumof the DCT coefficients) remains below a predetermined threshold valuefor a predetermined time and then goes above the threshold, the intersentence pause detector 152 outputs a trigger signal. The predeterminedtime is set to be longer than the average of intra sentence pauses. Thetrigger signal is output at the end of long (inter sentence) pauses.According to the preferred embodiment of the invention adjustment of thenon speech sound model is based on background sounds that occur near theend of inter sentence breaks in speech sound. Note that inter sentencepause detector 152 may be triggered after long breaks (e.g., 15 minutes)in speech sounds

[0086] A comparer and updater 154 is coupled to the inter-sentence pausedetector for receiving the trigger signal. The comparer and updater 154also coupled to the second buffer 122 for receiving feature vectors. Inresponse to receiving the trigger signal the comparer and updater 154reads one or more feature vectors that were extracted from the end ofthe inter sentence pause from the second buffer 122. Preferably, morethan one feature vector is read from the second buffer 122 and averagedtogether element by element to obtain a characteristic feature vector(CRV) that corresponds to at least a portion of the inter sentencepause. Alternatively a weighted sum of feature vectors from the intersentence pause is used. Weights used in the weighted sum may becoefficients of a FIR low pass filter. According to another alternativeembodiment of the invention the weighted sum may sum feature vectorsextracted from multiple inter sentence pauses (excluding speech soundsbetween them). Alternatively, one feature vector extracted from thevicinity of the end of the inter sentence pause is used as thecharacteristic feature vector. Once the characteristic feature vectorhas been obtained, a mean vector, from among a plurality mean vectors ofone or more emitting states of the background sound model, that isclosest to the characteristic feature vector is determined. The closestmean is denoted

μ_(jn) ^(*)=[μ_(1jn),μ_(1jn),μ_(1jn), . . . μ_(ijn), . . . μ_(Djn),]

[0087] The closest mean belongs to an nth mixture component of a jthstate.

[0088] Closeness is preferably judged by determining which mixturecomponent assumes the highest value when evaluated using thecharacteristic feature vector. Alternatively, closeness is judged bydetermining which mean vector μ_(jn) yields the highest dot product withthe characteristic feature vector. According to another alternative,closeness is judged by evaluating the Euclidean vector norm distancebetween the characteristic feature vector and each mean vector μ_(jn)and determining which distance is smallest. The invention is not limitedto any particular way of determining the closeness of the characteristicfeature vector to the mean vectors μ_(jn) of the Gaussian mixturecomponents. Once the closest mean vector is identified, the mixturecomponent with which it is associated is altered so that it yields ahigher a posteriori probability when evaluated with the characteristicfeature vector. Preferably, the latter is accomplished by altering theidentified closest mean vector so that it is closer to thecharacteristic feature vector. More preferably the alteration of theidentified closest mean vector μ_(jn) ^(*) is performed using thefollowing transformation equation:

μ_(jn) ^(new)=(1−α)μ_(jn) ^(*) α*CRV

[0089] where

[0090] μ_(jn) ^(new) is a new mean vector to replace the identifiedclosest mean vector μ_(jn) ^(*)

[0091] α is a weighting parameter that is preferably at least about 0.7and more preferably at least about 0.9; and

[0092] CRV is the characteristic feature vector for non speechbackground sounds as measured during the inter sentence pause.

[0093] Thus as a user continues to use the ASR system 100 as thebackground sounds in the environment of the ASR system 100 change, thesystem 100 will continue to update one or more of the means of theGaussian mixtures of the non speech sound emitting state, so that the atleast one component of the Gaussian mixtures better match the ambientnoise. Thus the ASR system 100 will be better able to identifybackground noise, and the likelihood of the ASR system 100 construingbackground noise 100 as a speech phoneme will be reduced. Ultimately,the recognition performance of the ASR system is improved.

[0094] The ASR system 100 may be implemented in hardware or software ora combination of the two.

[0095]FIG. 2 is a flow chart of a process 200 for updating a model ofbackground noise according to the preferred embodiment of the invention.Referring to FIG. 2, in process block 202 an HMM ASR process is run onan audio signal that includes speech and non speech background sounds.Block 202 is decision block that depends on whether a long pause in thespeech component of the audio signal is detected. If a long pause is notdetected then the process 200 loops back to block 202 and continues torun the HMM ASR process. If a long pause is detected, the processcontinues with process block 206 in which a characteristic featurevector that characterizes the audio signal during the long pause (i.e.,characterizes the background sound) is extracted from the audio signal.After process block 206, in process block 208 a particular mean of amulti-component Gaussian mixture that is used to model non speechbackground sounds that is closest to the characteristic feature vectorextracted in block 206 is found. In process block 210 the particularmean found in process block 208 is updated so that it is closer to thecharacteristic feature vector extracted in block 206. From block 210 theprocess 200 loops back to block 202.

[0096]FIG. 3 is a high level flow chart of a process 300 of performingautomated speech recognition using an HMM. FIG. 3 is a preferred form ofblock 202 of FIG. 2. In process block 302 for each successive incrementof time (frame) a feature vector that characterizes an audio signal isextracted. In process block 304 for each successive increment of time,the feature vector is used to evaluate Gaussian mixtures that give the aposteriori probabilities that various states of the HMM result in audiosignal characterized by the feature vector. In process block 306 themost probable sequence of HMM states is determined in view of the aposteriori probabilities and transition probabilities that governtransitions between the HMM states. For each subsequent frame i.e., asspeech continues to be processed, the most probable sequence of HMMstates is updated. A variety of methods of varying computationalcomplexity are known to persons of ordinary skill in the ASR art forfinding the most probable sequence of HMM states.

[0097]FIG. 4 is a first part of flow chart of a process 400 forextracting feature vectors from an audio signal according to thepreferred embodiment of the invention. FIGS. 4 and 5 show a preferredform of block 302 of FIG. 3. In step 402 an audio signal is sampled inthe time domain to obtain a discretized representation of the audiosignal that includes a sequence of samples. In step 404 a FIR filter isapplied to the sequence of samples to emphasize high frequencycomponents. In step 406 a window function is applied to successivesubsets (frames) of the sequence of samples. In step 408 a FFT isapplied to successive frames of samples to obtain a plurality offrequency components. In step 410 the plurality of frequency componentsare run through a MEL scale filter bank to obtain a plurality of MELscale frequency components. In step 412 the magnitude of each MEL scalefrequency component is taken to obtain a plurality of MEL frequencycomponent magnitudes. In step 414 the log of each MEL frequencycomponent magnitude is taken to obtain a plurality of log magnitude MELscale frequency components. Referring to FIG. 5 which is a second partof the flow chart begun in FIG. 4, in step 502 a DCT is applied to thelog magnitude MEL scale frequency components for each frame to obtain acepstral coefficient vector for each frame. In step 504 first or higherorder differences are taken between corresponding cepstral coefficientsfor two or more frames to obtain at least first order inter framecepstral coefficient differences (deltas). In step 506 for each framethe cepstral coefficients and the inter frame cepstral coefficientdifferences are output as a feature vector.

[0098]FIG. 6 is a hardware block diagram of the system 100 forperforming automated speech recognition according to the preferredembodiment of the invention. As illustrated in FIG. 6, the system 100 isa processor 602 based system that executes programs 200, 300, 400 thatare stored in a program memory 606. The program memory 606 is a form ofcomputer readable medium. The processor 602, program memory 606, aworkspace memory 604, e.g. Random Access Memory (RAM), and input/output(I/O) interface 610 are coupled together through a digital signal bus608. The I/O interface 610 is also coupled to an analog to digitalconverter (A/D) 612 and to a transcribed language output 614. The A/D612 is coupled to the audio signal input 102 that preferably comprises amicrophone. In operation the audio signal is input at the audio signalinput 102 converted to the above mentioned discretized representation ofthe audio signal by the A/D 612 which operates under the control of theprocessor 602. The processor executes the programs described withreference to FIGS. 2-5 and outputs a stream of recognized sentencesthrough the transcribed language output 614. Alternatively therecognized words or sentences are used to control the operation of otherprograms executed by the processor. For example the system 100 maycomprise other peripheral devices such as wireless phone transceiver(not shown), in which case the recognized words may be used to select atelephone number to be dialed automatically. The processor 602preferably comprises a digital signal processor (DST). Digital signalprocessors have instruction sets and architectures that are suitable forprocessing audio signal.

[0099] As will be apparent to those of ordinary skill in the pertinentarts, the invention may be implemented in hardware or software or acombination thereof. Programs embodying the invention or portionsthereof may be stored on a variety of types of computer readable mediaincluding optical disks, hard disk drives, tapes, programmable read onlymemory chips. Network circuits may also serve temporarily as computerreadable media from which programs taught by the present invention areread.

[0100] While the preferred and other embodiments of the invention havebeen illustrated and described, it will be clear that the invention isnot so limited. Numerous modifications, changes, variations,substitutions, and equivalents will occur to those of ordinary skill inthe art without departing from the spirit and scope of the presentinvention as defined by the following claims.

We claim:
 1. A method of performing automatic speech recognition in avariable background noise environment, the method comprising the stepsof: processing a first portion of an audio signal to obtain a firstcharacterization of the first portion of the audio signal; comparing thefirst characterization to a set of reference characterizations todetermine a particular reference characterization among the set ofreference characterizations that most closely matches the firstcharacterization; updating the particular reference characterization sothat the particular reference characterization more closely resemblesthe first characterization.
 2. The method according to claim 1 furthercomprising the step of: detecting an inter sentence pause; and inresponse to the step of detecting, performing the step of processing thefirst portion of the audio signal wherein the first portion of the audiosignal is included in the inter sentence pause.
 3. The method accordingto claim 2 wherein: the step of processing the first portion of theaudio signal to obtain a first characterization includes a sub-step of:processing the first portion of the audio signal to obtain a first setof numbers that characterize the first portion of the audio signal; andthe step of comparing the first characterization to a set of referencecharacterizations comprises the sub-steps of: comparing the first setnumbers to a plurality of reference sets of numbers to determining aparticular set of reference numbers that most closely matches the firstset of numbers.
 4. The method according to claim 3 wherein the step ofupdating the reference characterization comprises the sub-steps of:replacing each number in the particular set of numbers with a weightedaverage of the number and a corresponding number in the first set ofnumbers.
 5. The method according to claim 4 wherein the step ofcomparing the first characterization to a set of referencecharacterizations comprises the sub-steps of: taking a dot productbetween the first set of numbers and each of the plurality of referencesets of numbers.
 6. The method according to claim 5 wherein: theplurality of reference sets of numbers characterize a plurality of typesof non speech audio.
 7. The method according to claim 6 wherein theplurality of reference sets of numbers are means of components ofGaussian mixtures that characterize the probability of an underlyingstate of a hidden markov model of the audio signal, given the first setof numbers.
 8. The method according to claim 7 wherein the step ofprocessing the first portion of the audio signal to obtain the firstcharacterization of the first portion of the audio signal comprises thesub-steps of: a) time domain sampling the audio signal to obtain adiscretized representation of the audio signal that includes a sequenceof samples; b) time domain filtering the sequence of samples to obtain afiltered sequence of samples; c) applying a window function tosuccessive subsets of the filtered sequence of samples to obtain asequence of frames of windowed filtered samples; d) transforming each ofthe frames of windowed filtered samples to a frequency domain to obtaina plurality of frequency components; e) taking a plurality of weightedsums of the plurality of frequency components to obtain a plurality ofbandpass filtered outputs; f) taking the log of the magnitude of each ofthe bandpass filtered outputs to obtain a plurality of log magnitudebandpass filtered outputs; and g) transforming the plurality of logmagnitude bandpass filtered outputs to a time domain to obtain at leasta subset of the first set of numbers.
 9. The method according to claim 8wherein the step of processing the first portion of the audio signal toobtain the first characterization of the first portion of the audiosignal further comprises the sub-steps of: repeating sub-steps (a)through (g) for two portions of the audio signal to obtain two sets ofnumbers; and taking the difference between corresponding numbers in thetwo sets of numbers to obtain at least a subset of the first set ofnumbers.
 10. An automated speech recognition system comprising: an audiosignal input for inputting an audio signal that includes speech andbackground sounds; a feature extractor coupled to the audio signal inputfor receiving the audio signal and outputting characterizations of asequence of segments of the audio signal; a model coupled to the featureextractor, wherein the model includes a plurality of states to whichcharacterization of the sequence of segments are applied for evaluatinga posteriori probabilities that one or more of the plurality of statesoccurred; a search engine coupled to model for finding one or more highprobability sequences of the plurality of states of the model; adetector for detecting a specific state of the audio signal andoutputting a predetermined signal when the specific state is detected;and a comparer and updater coupled to the detector for receiving thepredetermined signal and in response thereto updating the model so thatit more closely models one or more characterizations output by thefeature extractor that correspond to the specific state.
 11. Theautomated speech recognition system according to claim 10 wherein: thefeature extractor outputs characterizations for each of a succession offrames that include feature vectors that include cepstral coefficients;the model comprises a hidden markov model that includes a plurality ofemitting states and multi component Gaussian mixtures that give the aposteriori probability that a given feature vector is attributable to agiven emitting state; the detector detects an absence of speech soundsby comparing a function of one more cepstral coefficients to athreshold; and the comparer and updater determines a mean of a multicomponent Gaussian mixture associated with background sounds that isclosest to a feature vector that characterizes the audio signal duringthe absence of speech sounds, and updates the mean so that it is closerto the feature vector that characterizes the audio signal during theabsence of speech sounds.
 12. An automated speech recognition systemcomprising: an audio input for inputting an audio signal; an analog todigital converter coupled to the audio input for sampling the audiosignal and outputting a discretized audio signal; and a microprocessorcoupled to the analog to digital converter for receiving the discretizedaudio signal and executing a program for performing automated speechrecognition, the program comprising programming instructions for:processing a first portion of an audio signal to obtain a firstcharacterization of the first portion of the audio signal; comparing thefirst characterization to a set of reference characterizations todetermine a particular reference characterization among the set ofreference characterizations that most closely matches the firstcharacterization; and updating the particular reference characterizationso that the particular reference characterization more closely resemblesthe first characterization;
 13. A computer readable medium storingprogramming instructions for performing automatic speech recognition ina variable background noise environment, including programminginstructions for: processing a first portion of an audio signal toobtain a first characterization of the first portion of the audiosignal; comparing the first characterization to a set of referencecharacterizations to determine a particular reference characterizationamong the set of reference characterizations that most closely matchesthe first characterization; updating the particular referencecharacterization so that the particular reference characterization moreclosely resembles the first characterization; processing one or moreadditional portions of the audio signal to obtain a one or moreadditional characterizations that characterize the one or moreadditional portions of the audio signal; comparing the one or moreadditional characterizations to the set of reference characterization tofind reference characterizations among the set of referencecharacterizations that most closely matches the one or more additionalcharacterizations.
 14. The computer readable medium according to claim13 further comprising programming instructions for: detecting an intersentence pause; and in response to the step of detecting, performing thestep of processing the first portion of the audio signal wherein thefirst portion of the audio signal is included in the inter sentencepause.
 15. The computer readable medium according to claim 14 wherein:the programming instructions for processing the first portion of theaudio signal to obtain a first characterization include programminginstructions for: processing the first portion of the audio signal toobtain a first set of numbers that characterize the first portion of theaudio signal; and the programming instructions for comparing the firstcharacterization to a set of reference characterizations comprises theprogramming instructions for: comparing the first set numbers to aplurality of reference sets of numbers to determining a particular setof reference numbers that most closely matches the first set of numbers.16. The computer readable medium according to claim 15 wherein theprogramming instructions for updating the reference characterizationcomprise programming instructions for: replacing each number in theparticular set of numbers with a weighted average of the number and acorresponding number in the first set of numbers.
 17. The computerreadable medium according to claim 16 wherein the programminginstructions for comparing the first characterization to a set ofreference characterizations comprise programming instructions for:taking a dot product between the first set of numbers and each of theplurality of reference sets of numbers.
 18. The computer readable mediumaccording to claim 17 wherein: the plurality of reference sets ofnumbers characterize a plurality of types of non voiced audio.
 19. Thecomputer readable medium according to claim 18 wherein: the plurality ofreference sets of numbers are means of components of Gaussian mixturesthat characterize the probability of an underlying state of a hiddenmarkov model of the audio signal, given the first set of numbers. 20.The computer readable medium according to claim 19 wherein theprogramming instructions for processing the first portion of the audiosignal to obtain the first characterization of the first portion of theaudio signal comprise the programming instructions for: a) time domainsampling the audio signal to obtain a discretized representation of theaudio signal that includes a sequence of samples; b) time domainfiltering the sequence of samples to obtain a filtered sequence ofsamples; c) applying a window function to successive subsets of thefiltered sequence of samples to obtain a sequence of frames of windowedfiltered samples; d) transforming each of the frames of windowedfiltered samples to a frequency domain to obtain a plurality offrequency components; e) taking a plurality of weighted sums of theplurality of frequency components to obtain a plurality of bandpassfiltered outputs; f) taking the log of the magnitude of each of thebandpass filtered outputs to obtain a plurality of log magnitudebandpass filtered outputs; and g) transforming the plurality of logmagnitude bandpass filtered outputs to a time domain to obtain at leasta subset of the first set of numbers.
 21. The computer readable mediumaccording to claim 20 wherein the programming instructions forprocessing the first portion of the audio signal to obtain the firstcharacterization of the first portion of the audio signal furthercomprises programming instructions for: applying programminginstructions (a) through (g) to two portions of the audio signal toobtain two sets of numbers; and taking the difference betweencorresponding numbers in the two sets of numbers to obtain at least asubset of the first set of numbers.